1 Introduction

Team 010100 are the following members: Izzy Illari, Lucia Illari, Omar Qusous, and Lydia Teinfalt. You may find our work over on GitHub.

For the second portion of our group project, we kept Olympics data from the EDA. Our SMART questions were What factors can be used to model the probability of being awarded a medal? What groups/clusters do athletes of different sports fall into? How does a pandemic affect the medals awarded? How can the evolution of athlete characteristics over time be modelled? With these questions in mind we went to see if we could find use the data on Olympians to find patterns and create models that could answer the questions.

We used a dataset called 120 years of Olympic history: athletes and results on Kaggle over here: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results. This historical dataset includes all Olympic Games from Athens 1896 to Rio 2016, which was scraped from https://www.sports-reference.com/. We focused on data from Olympic events 1960-2016 when looking at clustering, Kmeans, Linear and Logit Regression and trends over time. For the pandemic analysis, we focused on data of Olympics participating in events before and after the H1N1 Pandemic from 1918-1919.

The report is organized as follows:

  1. Summary of Dataset
  2. Data Prep
  3. EDA
  4. Clustering, Kmeans, Kmedoids
  5. Linear and Logit Regression
  6. Random Forest
  7. Pandemic (Spanish Flu)
  8. Trends over time
  9. Summary and Conclusion
  10. References

2 Summary of Dataset

The data looks like the following:

'data.frame':   151977 obs. of  24 variables:
 $ NOC         : Factor w/ 122 levels "AFG","ALB","AND",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Year        : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ Decade      : Factor w/ 6 levels "1960s","1970s",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID          : Factor w/ 74771 levels "1","2","6","7",..: 32644 32479 60453 32515 70738 16344 21919 59125 70738 32153 ...
 $ First.Name  : Factor w/ 14118 levels "","A","A.","Aadam",..: 8716 3731 64 599 64 11978 64 4634 64 8716 ...
 $ Name        : Factor w/ 74268 levels "  Gabrielle Marie \"Gabby\" Adcock (White-)",..: 48941 19066 219 3341 221 64833 216 23793 221 48946 ...
 $ Last.Name   : Factor w/ 47370 levels "","-)","-Alard)",..: 23228 23112 38893 23137 44908 13260 16633 37860 44908 22890 ...
 $ Sex         : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age         : int  24 18 20 35 20 28 22 23 20 20 ...
 $ Height      : int  171 162 178 166 179 168 172 170 179 166 ...
 $ Weight      : num  78 52 68 66 75 73 70 58 75 62 ...
 $ BMI         : num  26.7 19.8 21.5 24 23.4 ...
 $ BMI.Category: Factor w/ 5 levels "0","1","2","3",..: 4 1 3 3 3 4 3 3 3 3 ...
 $ Team        : Factor w/ 332 levels "Acipactli","Afghanistan",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Population  : int  8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 ...
 $ GDP         : num  5.38e+08 5.38e+08 5.38e+08 5.38e+08 5.38e+08 ...
 $ GDPpC       : num  59.8 59.8 59.8 59.8 59.8 ...
 $ Games       : Factor w/ 30 levels "1960 Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season      : Factor w/ 2 levels "Summer","Winter": 1 1 1 1 1 1 1 1 1 1 ...
 $ City        : Factor w/ 29 levels "Albertville",..: 19 19 19 19 19 19 19 19 19 19 ...
 $ Sport       : Factor w/ 51 levels "Alpine Skiing",..: 51 51 3 51 3 51 3 3 3 51 ...
 $ Event       : Factor w/ 489 levels "Alpine Skiing Men's Combined",..: 478 468 17 476 33 482 22 24 18 466 ...
 $ Medal       : Factor w/ 4 levels "Bronze","Gold",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Medal.No.Yes: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

The athlete events data has 24 columns and 151977 rows/entries, for a total of 3647448 individual data points. In olympic_data each row corresponds to an individual athlete competing in an individual Olympic event. The variables are the following:

  1. ID: Unique number for each athlete
  2. Name: Athlete’s name
  3. Sex: M or F
  4. Age: Integer
  5. Height: centimeters
  6. Weight: kilograms
  7. Team: Team name
  8. NOC: National Olympic Committee 3-letter code
  9. Games: Year and season
  10. Year: Integer
  11. Season: Summer or Winter
  12. City: Host city
  13. Sport
  14. Event
  15. Medal: Gold, Silver, Bronze, or NA

To prepare our data for EDA we dropped the Olympic event: Art Sculpting. NAs were also removed. We have modified the data from the kaggle dataset from which it was originally taken. The dataset now starts at 1960 and includes the new following variables:

  1. Decade (factor)
  2. First name (factor)
  3. Last name (factor)
  4. BMI (numeric)
  5. BMI category (factor)
  6. Population (numeric)
  7. GDP (numeric)
  8. GDPpC (numeric)
  9. Medal: Yes or No (factor)

3 EDA

For EDA, we can do a quick summary to just look at the data.

Table: Statistics summary.
NOC Year Decade ID First.Name Name Last.Name Sex Age Height Weight BMI BMI.Category Team Population GDP GDPpC Games Season City Sport Event Medal Medal.No.Yes
Min USA :12218 1960 1960s:14506 94406 : 30 John : 1102 Michael Fred Phelps, II: 30 Jr. : 488 F:46954 12.00 127.0 28.00 10.50 0:16785 United States:11748 9.891e+03 3.029e+07 59.5 2000 Summer:10621 Summer:105729 Sydney :10621 Athletics :18072 Ice Hockey Men’s Ice Hockey : 2399 Bronze : 6116 0:115990
Q1 CAN : 7527 1984 1970s:11261 11951 : 27 David : 995 Ole Einar Bjrndalen : 27 Smith : 269 M:87011 21.00 168.0 60.00 20.90 1: 3311 Canada : 7250 1.028e+07 6.500e+10 2759.0 2008 Summer:10568 Winter: 28236 Beijing :10568 Swimming :12767 Hockey Men’s Hockey : 2024 Gold : 5941 1: 17975
Median ITA : 7419 1998 1980s:19330 91845 : 25 Michael: 923 Yang Wei : 26 Garca : 256 NA 24.00 175.0 70.00 22.50 2:92817 France : 7201 4.208e+07 3.300e+11 10586.0 2016 Summer:10448 NA Rio de Janeiro:10448 Gymnastics :10437 Football Men’s Football : 1996 No Medal:115990 NA
Mean FRA : 7358 1995 1990s:23126 12678 : 24 Kim : 882 Gabriella Paruzzi : 25 Silva : 249 NA 25.14 175.3 70.47 22.73 3:18254 Italy : 7174 1.132e+08 1.490e+12 16809.3 2004 Summer:10120 NA Athina :10120 Cross Country Skiing: 5434 Basketball Men’s Basketball : 1469 Silver : 5918 NA
Q3 JPN : 7219 2008 2000s:38099 14170 : 24 Robert : 808 Lee Ju-Hyeong : 25 Rodrguez: 219 NA 28.00 183.0 79.00 24.20 4: 2798 Japan : 7071 9.315e+07 1.340e+12 26401.8 2012 Summer: 9904 NA London : 9904 Cycling : 5044 Cycling Men’s Road Race, Individual: 1306 NA NA
Max GBR : 7169 2016 2010s:27643 15991 : 24 Jos : 760 Alberto Busnari : 24 Gonzlez : 217 NA 71.00 226.0 214.00 63.90 NA Great Britain: 6915 1.380e+09 1.870e+13 178846.0 1996 Summer: 9102 NA Atlanta : 9102 Rowing : 4712 Water Polo Men’s Water Polo : 1286 NA NA
NA (Other):85055 NA NA (Other):133811 (Other):128495 (Other) :133808 (Other) :132267 NA NA NA NA NA NA (Other) :86606 NA NA NA (Other) :73202 NA (Other) :73202 (Other) :77499 (Other) :123485 NA NA

4 Olympics Correlation plot

Just quickly visualizing thw correlation will be useful for model building, but we have to be mindful of the fact that columns such as Medal and Medal.No.Yes are noturally going to be highly correlated.

It might be more useful to focus in on the correlations for only the variable Medal.No.Yes and Medal:

rowname Medal.No.Yes
GDP 0.1414144
GDPpC 0.0884261
Height 0.0831910
Weight 0.0771523
Population 0.0756385
BMI.Category 0.0722161
NOC 0.0663156
Team 0.0620423
Event 0.0551700
Sport 0.0550994
Year 0.0520386
Decade 0.0517238
Games 0.0513962
BMI 0.0418939
Age 0.0314189
ID 0.0105085
First.Name 0.0044511
Name 0.0042080
Last.Name -0.0067437
City -0.0180115
Sex -0.0305938
Season -0.0420744
Medal -0.4586621
rowname Medal
Season 0.0237157
City 0.0108140
Sex 0.0083743
Last.Name 0.0032221
Name -0.0017224
First.Name -0.0017838
ID -0.0057542
Age -0.0150506
Games -0.0204830
Decade -0.0208330
Year -0.0209759
Team -0.0211824
BMI -0.0230238
Event -0.0268180
Sport -0.0271357
Population -0.0276393
NOC -0.0294776
BMI.Category -0.0364803
GDPpC -0.0372752
Weight -0.0380738
Height -0.0381812
GDP -0.0600564
Medal.No.Yes -0.4586621

Unless you ignore the athletes that didn’t receive a medal, building a general model from the variable Medal.No.Yes might be a better idea, based off the strength of the correlations.

5 Clustering, Kmeans, Kmedoids

My first thought was to do some clustering with just the numeric columns originally present in the data, namely Age, Weight, and Height, so I decided to look at some 3D scatter plots.

So we are indeed seeing different behavior with these two sports - Triathlon appears more spread out but Softball appears to mostly clustered around lower ages. Well first thing’s first, before we just go ahead seeing what clusters there are, we should calculate the Hopkin’s Statistic. We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that D has statistically significant clusters. Put in other words, If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data. We need to make sure to remove NAs and scale the variables to make them comparable. Scaling consists of transforming the variables such that they have mean zero and standard deviation one.

[1] "Triathlon without Population and GDP"
Hopkins Statistic H = 0.7551693 
[1] "Triathlon with Population and GDP"
Hopkins Statistic H = 0.8262848 
[1] "Softball without Population and GDP"
Hopkins Statistic H = 0.7773288 
[1] "Softball with Population and GDP"
Hopkins Statistic H = 0.7967403 

Clearly all of these values are greater than 0.5, so there are statistically significant clusters present. Of course, Kmeans (and Kmedoids) requires us to specify “how many \(k\), i.e., clusters?” We can use the elbow method, the silhouette method, and the gap statistic to get an idea for how many \(k\) we should be specifying.

For the Triathlon data with only variables Age, Height, and Weight, for Kmeans and Kmedoids, we’ll try \(k\)=2, and for the Triathlon data that additionally has Population and GDP, we’ll try \(k\)=3 and \(k\)=7.

This is interesting for the Softbll. Though according to the Hopkin’s statistic there were statistically significant clusters, \(k\)=1 keeps being suggested. There is likely to be a lot of overlapping data points in the clusters when $k$2. I can try \(k\)=2 for Softball with only Age, Weight, and Height, and \(k\)=3 for Kmedoids and \(k\)=6 for Kmeans for Population and GDP in addition.

Dim.1 Dim.2 Dim.3
Age 0.2739968 99.3952316 0.3307716
Height 50.0136554 0.0013354 49.9850093
Weight 49.7123478 0.6034330 49.6842192

# A tibble: 2 x 4
  Cluster   Age Height Weight
    <int> <dbl>  <dbl>  <dbl>
1       1  28.2   166.   54.2
2       2  27.5   180.   68.4
# A tibble: 2 x 4
  Cluster   Age Height Weight
    <int> <dbl>  <dbl>  <dbl>
1       1  27.4   180.   68.1
2       2  28.4   166.   53.8

I will only print out the means for the clustering here with the most distinct clusters, which is \(k\)=3 using Kmeans.

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Age 0.2738287 0.3359204 89.3319869 9.5647119 0.4935521
Height 49.5850049 0.5009423 0.0013683 0.1849146 49.7277699
Weight 49.4872272 0.2153672 0.3818564 0.6547254 49.2608238
Population 0.2345734 50.8194590 3.1923196 45.3527555 0.4008925
GDP 0.4193659 48.1283110 7.0924688 44.2428925 0.1169618

# A tibble: 3 x 6
  Cluster   Age Height Weight Population     GDP
    <int> <dbl>  <dbl>  <dbl>      <dbl>   <dbl>
1       1  28.3   175.   63.7 590769231. 1.19e13
2       2  27.4   180.   68.2  52221408. 1.11e12
3       3  28.1   166.   53.7  59387371. 1.56e12

We use Population and GDP here as a proxy for Team, since Team is not a continuous variable. Yes, we could make variables such as Team numeric, and use it that way, but what woudl a cluster with a mean Team = 2.5 mean? How is an athlete “inbetween” a team? In this way we can get a sense for the Teams while using continuous variables. For example, cluster 3, the only countries that meet this GDP requirement are China and the US, so there is a distinct cluster for the Triathlon data made up of American and Chinese athlete. In fact, if we look at all the GDP values, they are all relatively high, and correspond to countries like Australia, Canada, etc, which tells us that these clusters are all made of athletes from rather rich countries.

Moving on to softball:

Dim.1 Dim.2 Dim.3
Age 1.59669 98.0171460 0.3861638
Height 49.58696 0.2061233 50.2069183
Weight 48.81635 1.7767307 49.4069180

# A tibble: 2 x 4
  Cluster   Age Height Weight
    <int> <dbl>  <dbl>  <dbl>
1       1  25.8   167.   64.3
2       2  28.4   177.   77.3
# A tibble: 2 x 4
  Cluster   Age Height Weight
    <int> <dbl>  <dbl>  <dbl>
1       1  28.4   175.   74.6
2       2  25.1   166.   63.2

Using \(k\)=2 wasn’t actually too bad! And Kmeans and Kmedoids appears to have recovered very similar centers. Moving on to the Softball data with Population and GDP, technically \(k\)=3 was for Kmedoids and \(k\)=6 was for Kmeans, but I will try both cluster sizes for both methods.

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Age 0.4196026 44.1563110 35.479383 19.616889 0.327814
Height 41.5173922 0.9554527 1.662170 10.720112 45.144873
Weight 43.2316997 1.9327517 2.825011 2.867047 49.143491
Population 0.8178199 38.7382345 55.625875 1.149520 3.668550
GDP 14.0134856 14.2172501 4.407561 65.646431 1.715272

As we can see, the clusters are all on top of each other. \(k\)=3 seems to be roughly the same clusters using Kmeans or Kmedoids, but \(k\)=6 appears to have given very different results depending on the method. I will not print out the means/medoids here, since the clusters are on top of each other and not as distinct. We could investigate what these clusters look like when plotted for the different prinicpal axes:

but the clusters are still on top of each other. It appears for some sports adding Population and GDP can help define new clusters, but for other sports, it muddies them. It is possible other clustering methods may reveal clusters not found using Kmeans and Kmedoids.

6 Logit Regression

6.1 Linear Model

We added a new column, TMedals, to the Olympic dataset to hold the total number of medals earned for event year. The purpose of this new numeric column is to be able to build linear models.

Based on the correlation plot, GDP, Population, Height, and Sport.Int are the features with the highest correlation. Four models will be built, adding each feature to find the best fitted model. We can ignore Year and Decade showing high correlation with total number of medals because the field as created based on the olympic year.


Call:
lm(formula = dyn(TMedals ~ GDP + Population + Height + Sport.Int), 
    data = new_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2242.2 -1094.1   219.7   918.3  1512.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.128e+03  1.205e+02   9.364  < 2e-16 ***
GDP         4.436e-11  1.855e-12  23.915  < 2e-16 ***
Population  2.313e-07  2.808e-08   8.237  < 2e-16 ***
Height      5.518e+00  6.742e-01   8.184 2.92e-16 ***
Sport.Int   3.495e+00  4.973e-01   7.028 2.17e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 995.5 on 17970 degrees of freedom
Multiple R-squared:  0.05487,   Adjusted R-squared:  0.05466 
F-statistic: 260.8 on 4 and 17970 DF,  p-value: < 2.2e-16
Adjusted R2
Linear Model 1: TMedals ~ GDP 0.0454538
Linear Model 2: TMedals ~ GDP + Population 0.0484146
Linear Model 3: TMedals ~ GDP + Population + Height 0.0521113
Linear Model 4: TMedals ~ GDP + Population + Height + Sport.Int 0.0546568
VIFs of the model
VIF
GDP 1.1481
Population 1.1740
Height 1.0270
Sport.Int 1.0099

According to the Adjusted \(R^2\) value, model 4 is the best fit. The coefficients’ p-values for (Intercept), GDP, Population, Height, and Sport are less than significance level \(\alpha\) = 0.05, making them statistically significant. The VIF values for GDP, Population, Height and Sport are greater than 1 but less than 5 so multicollinearity is not an issue with this model.

Analysis of Variance Table

Model 1: TMedals ~ GDP
Model 2: TMedals ~ GDP + Population
Model 3: TMedals ~ GDP + Population + Height
Model 4: TMedals ~ GDP + Population + Height + Sport.Int
  Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
1  17973 1.7985e+10                                  
2  17972 1.7928e+10  1  56784414 57.299 3.925e-14 ***
3  17971 1.7858e+10  1  70640998 71.281 < 2.2e-16 ***
4  17970 1.7809e+10  1  48946153 49.389 2.174e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
     Res.Df           RSS                  Df      Sum of Sq       
 Min.   :17970   Min.   :1.781e+10   Min.   :1   Min.   :48946153  
 1st Qu.:17971   1st Qu.:1.785e+10   1st Qu.:1   1st Qu.:52865283  
 Median :17972   Median :1.789e+10   Median :1   Median :56784414  
 Mean   :17972   Mean   :1.789e+10   Mean   :1   Mean   :58790521  
 3rd Qu.:17972   3rd Qu.:1.794e+10   3rd Qu.:1   3rd Qu.:63712706  
 Max.   :17973   Max.   :1.799e+10   Max.   :1   Max.   :70640998  
                                     NA's   :1   NA's   :1         
       F             Pr(>F) 
 Min.   :49.39   Min.   :0  
 1st Qu.:53.34   1st Qu.:0  
 Median :57.30   Median :0  
 Mean   :59.32   Mean   :0  
 3rd Qu.:64.29   3rd Qu.:0  
 Max.   :71.28   Max.   :0  
 NA's   :1       NA's   :1  

An Anova test was used to compare the four models. The p-values are less then standard \(\alpha\) = 0.05 so the models differ significantly. There is enough variance that you can reject the null hypothesis that the models are the same.

6.2 Logit Regression: Factors influencing Earning Olympic Medals

We used logit regression to model what factors influence the chances of an athlete receiving a medal, Medal.No.Yes.


Call:
glm(formula = Medal.No.Yes ~ Sex.Int + Height + Sport.Int, family = binomial(link = "logit"), 
    data = ol_dt_subset)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0290  -0.5688  -0.5000  -0.4327   2.6067  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.3041122  0.1427563  -51.16   <2e-16 ***
Sex.Int     -0.5789713  0.0196270  -29.50   <2e-16 ***
Height       0.0349904  0.0008865   39.47   <2e-16 ***
Sport.Int    0.0094889  0.0005236   18.12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 105631  on 133964  degrees of freedom
Residual deviance: 103524  on 133961  degrees of freedom
AIC: 103532

Number of Fisher Scoring iterations: 4
 (Intercept)      Sex.Int       Height    Sport.Int 
0.0006727665 0.5604746546 1.0356098076 1.0095340681 
                   2.5 %      97.5 %
(Intercept) -7.583909308 -7.02431508
Sex.Int     -0.617439435 -0.54050308
Height       0.033252845  0.03672803
Sport.Int    0.008462602  0.01051521

fitting null model for pseudo-r2

The McFadden (part of the pseudo-R\(^2\) statistics) value of 0.02 also shows the model is not a particularly great model, with only 2% of the variation explained.

6.3 Logistic Regression: Medal.No.Yes variable

Now we have already looked at the correlation matrix for the entire dataset, and it looks like Medal.No.Yes might have some stronger correlations than Medal. We’ll be attempting to use the entire dataset for logistic regression to see what variables go into predicting whether an athlete is awarded a medal or not.


Call:
glm(formula = Medal.No.Yes ~ GDP + Height + Weight + Population + 
    Sport, family = binomial(link = "logit"), data = olympic.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4937  -0.5543  -0.4351  -0.3488   2.6108  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -4.560e+00  2.189e-01 -20.830  < 2e-16 ***
GDP                             8.836e-14  2.282e-15  38.719  < 2e-16 ***
Height                          8.280e-03  1.527e-03   5.421 5.94e-08 ***
Weight                          1.373e-03  1.068e-03   1.285 0.198737    
Population                      3.169e-10  3.116e-11  10.169  < 2e-16 ***
SportArchery                    8.303e-01  1.052e-01   7.894 2.93e-15 ***
SportAthletics                  5.222e-01  7.130e-02   7.324 2.40e-13 ***
SportBadminton                  6.829e-01  1.248e-01   5.471 4.46e-08 ***
SportBaseball                   2.654e+00  1.052e-01  25.215  < 2e-16 ***
SportBasketball                 1.608e+00  8.274e-02  19.431  < 2e-16 ***
SportBeach Volleyball           7.545e-01  1.622e-01   4.652 3.28e-06 ***
SportBiathlon                   2.011e-01  9.511e-02   2.114 0.034499 *  
SportBobsleigh                  1.173e-01  1.242e-01   0.945 0.344823    
SportBoxing                     1.308e+00  8.421e-02  15.529  < 2e-16 ***
SportCanoeing                   1.012e+00  8.146e-02  12.418  < 2e-16 ***
 [ reached getOption("max.print") -- omitted 40 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 105748  on 134304  degrees of freedom
Residual deviance:  98409  on 134250  degrees of freedom
  (17672 observations deleted due to missingness)
AIC: 98519

Number of Fisher Scoring iterations: 5
         Predicted 0 Predicted 1  Total
Actual 0      115863         462 116325
Actual 1       17429         551  17980
Total         133292        1013 134305
fitting null model for pseudo-r2
  McFadden 
0.06939646 

Area under the curve: 0.6884

Team was not included because it was not significant however the overall model is significant. For the area under the curve, it is FALSE that it is more than 0.8, so this is not a good model. The true negative percentage was 99.6028369% and the true positive percentage was 3.0645161% so it appears that the model is mostly labelling everything as not receiving a medal.

Let’s try doing a model for Basketball, one of the more popular sports by athlete counts. Now I’m going to include Age, Weight, and Height and first, but there is something interesting that happens:


Call:
glm(formula = Medal.No.Yes ~ Age + Weight + Height + GDP + Population + 
    Team, family = binomial(link = "logit"), data = basket1)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.40839  -0.61712  -0.00009   0.00001   2.63631  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   3.709e+00  1.865e+00   1.989 0.046732 *  
Age                          -1.421e-02  1.668e-02  -0.852 0.394215    
Weight                       -1.961e-02  1.006e-02  -1.950 0.051226 .  
Height                       -4.557e-03  1.280e-02  -0.356 0.721886    
GDP                           5.767e-13  1.057e-13   5.458 4.83e-08 ***
Population                   -2.642e-08  3.426e-09  -7.710 1.26e-14 ***
TeamAustralia                -1.747e+00  3.325e-01  -5.254 1.49e-07 ***
TeamBelarus                  -2.039e+01  2.285e+03  -0.009 0.992883    
TeamBrazil                    1.347e+00  4.547e-01   2.964 0.003041 ** 
TeamCanada                   -2.016e+01  9.298e+02  -0.022 0.982702    
TeamCentral African Republic -2.033e+01  3.094e+03  -0.007 0.994757    
TeamChina                     2.678e+01  3.628e+00   7.383 1.55e-13 ***
TeamCongo (Kinshasa)         -1.955e+01  3.238e+03  -0.006 0.995183    
TeamCuba                     -2.445e+00  4.357e-01  -5.612 2.00e-08 ***
TeamCzech Republic           -2.046e+01  1.787e+03  -0.011 0.990866    
 [ reached getOption("max.print") -- omitted 27 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2840.1  on 2451  degrees of freedom
Residual deviance: 1416.6  on 2410  degrees of freedom
  (216 observations deleted due to missingness)
AIC: 1500.6

Number of Fisher Scoring iterations: 18
fitting null model for pseudo-r2
 McFadden 
0.5012314 
         Predicted 0 Predicted 1 Total
Actual 0        1773          27  1800
Actual 1         304         348   652
Total           2077         375  2452

Area under the curve: 0.9089

Looking at this, we see that the variables Age, Height, and Weight are not significant, but GDP, Population, and Team are! Let’s go ahead and see what the model looks like without these variables…


Call:
glm(formula = Medal.No.Yes ~ GDP + Population + Team, family = binomial(link = "logit"), 
    data = basket1)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.32386  -0.64567  -0.00008   0.00000   2.40026  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   4.522e-01  2.831e-01   1.597  0.11016    
GDP                           5.880e-13  1.038e-13   5.667 1.45e-08 ***
Population                   -2.716e-08  3.344e-09  -8.120 4.65e-16 ***
TeamAustralia                -1.452e+00  3.164e-01  -4.590 4.42e-06 ***
TeamBelarus                  -1.979e+01  2.293e+03  -0.009  0.99311    
TeamBrazil                    1.802e+00  4.387e-01   4.109 3.98e-05 ***
TeamCanada                   -1.966e+01  9.368e+02  -0.021  0.98325    
TeamCentral African Republic -1.995e+01  3.104e+03  -0.006  0.99487    
TeamChina                     2.798e+01  3.542e+00   7.900 2.79e-15 ***
TeamCongo (Kinshasa)         -1.886e+01  3.104e+03  -0.006  0.99515    
TeamCuba                     -1.980e+00  4.146e-01  -4.776 1.79e-06 ***
TeamCzech Republic           -1.985e+01  1.792e+03  -0.011  0.99116    
TeamEgypt                    -1.883e+01  1.618e+03  -0.012  0.99072    
TeamFinland                  -1.990e+01  3.104e+03  -0.006  0.99489    
TeamFrance                   -9.292e-01  3.708e-01  -2.506  0.01221 *  
 [ reached getOption("max.print") -- omitted 24 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2899.5  on 2535  degrees of freedom
Residual deviance: 1468.0  on 2497  degrees of freedom
  (132 observations deleted due to missingness)
AIC: 1546

Number of Fisher Scoring iterations: 18
fitting null model for pseudo-r2
 McFadden 
0.4937136 
         Predicted 0 Predicted 1 Total
Actual 0        1857          23  1880
Actual 1         297         359   656
Total           2154         382  2536

Area under the curve: 0.9065

The area under the curve is hardly affected - it’s a 0.0023698 difference! Moreover, the model with just GDP, Population, and Team is an overall significant model, and it is TRUE that the area under the curve is more than 0.8. In fact, the area under the curve is 0.0934954 away from being 1, so this would appear to be a very good model. The true negative percentage was 98.7765957% and the true positive percentage was 54.7256098% so this model is much better at predicting when an athlete will recieve a medal. Of course, we would like to see the TP value be higher, but it is interesting how Basketball is a sport that can be modelled by variables related to the country the athletes come from, rather than characteristics describing the athletes themselves…

Softball (and Baseball) were removed from the Olympics because the USA and Japan dominated the sport at the Olympics, let’s see what variables affect the logistical regression model for Softball then.


Call:
glm(formula = Medal.No.Yes ~ Age + Weight + Height + GDP + Population + 
    Team, family = binomial(link = "logit"), data = soft)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.83395  -0.00004  -0.00001   0.00003   1.32990  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.628e+01  3.507e+03   0.005  0.99630   
Age                1.208e-01  8.699e-02   1.389  0.16482   
Weight             1.690e-02  4.542e-02   0.372  0.70988   
Height             2.172e-02  6.557e-02   0.331  0.74043   
GDP                5.809e-12  2.946e-12   1.972  0.04864 * 
Population        -3.061e-07  1.151e-07  -2.658  0.00785 **
TeamCanada        -4.264e+01  4.561e+03  -0.009  0.99254   
TeamChina          3.485e+02  3.510e+03   0.099  0.92091   
TeamCuba          -4.304e+01  8.169e+03  -0.005  0.99580   
TeamItaly         -3.772e+01  5.796e+03  -0.007  0.99481   
TeamJapan         -1.238e+01  3.507e+03  -0.004  0.99718   
TeamNew Zealand   -4.548e+01  8.087e+03  -0.006  0.99551   
TeamUnited States  2.988e+01  4.195e+03   0.007  0.99432   
TeamVenezuela     -3.924e+01  8.033e+03  -0.005  0.99610   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 507.328  on 366  degrees of freedom
Residual deviance:  66.672  on 353  degrees of freedom
  (7 observations deleted due to missingness)
AIC: 94.672

Number of Fisher Scoring iterations: 20
fitting null model for pseudo-r2
 McFadden 
0.8685814 
         Predicted 0 Predicted 1 Total
Actual 0         180          15   195
Actual 1           3         169   172
Total            183         184   367

Area under the curve: 0.993

Look at these results! Age, Height and Weight have no significance, and GDP is at the * level and population is at the ** level. The area under the curve is 0.9929636 and The true negative percentage was 92.3076923% and the true positive percentage was 98.255814%.

Let’s only keep Population and Team and see what happens…


Call:
glm(formula = Medal.No.Yes ~ Population + Team, family = binomial(link = "logit"), 
    data = soft)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.75107  -0.00003  -0.00002   0.00003   0.80789  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.491e+01  3.768e+03   0.007 0.994726    
Population        -1.689e-07  5.047e-08  -3.346 0.000819 ***
TeamCanada        -4.120e+01  5.327e+03  -0.008 0.993829    
TeamChina          1.845e+02  3.769e+03   0.049 0.960953    
TeamCuba          -4.460e+01  8.436e+03  -0.005 0.995782    
TeamItaly         -3.680e+01  6.533e+03  -0.006 0.995506    
TeamJapan         -2.340e+00  3.768e+03  -0.001 0.999505    
TeamNew Zealand   -4.582e+01  8.436e+03  -0.005 0.995666    
TeamUnited States  4.655e+01  5.049e+03   0.009 0.992644    
TeamVenezuela     -4.181e+01  8.436e+03  -0.005 0.996046    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 517.79  on 373  degrees of freedom
Residual deviance:  75.40  on 364  degrees of freedom
AIC: 95.4

Number of Fisher Scoring iterations: 20
fitting null model for pseudo-r2
 McFadden 
0.8543811 
         Predicted 0 Predicted 1 Total
Actual 0         180          15   195
Actual 1           0         179   179
Total            180         194   374

Area under the curve: 0.9811

The area under the curve is ‘r pROC::auc(h.s2)’ and the true negative percentage was 92.3076923% and the true positive percentage was 100%. Softball (and baseball) were removed from the Olympics because the USA and Japan dominated the sport at the Olympics, and it is clear here, looking at these models, that you can predict the chance of being awarded a model based solely on where the team was coming from. Of course, removing Softball from the Olympics had far reaching, negative impacts that are still felt by the sport years later… But it makes sense looking at these logistical regression models why there was felt a need to remove Softball and Baseball from the Olympics.

For some sports, however, Age and Height do play a role, such as in Swimming, where Weight, GDP, and Population were not statistically significant:


Call:
glm(formula = Medal.No.Yes ~ Age + Weight + Height + GDP + Population + 
    Team, family = binomial(link = "logit"), data = swim)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8476  -0.4431  -0.2756  -0.0002   3.2767  

Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)   
(Intercept)                        -2.337e+01  2.289e+03  -0.010  0.99185   
Age                                 3.145e-02  9.750e-03   3.225  0.00126 **
Weight                              4.687e-03  6.035e-03   0.777  0.43744   
Height                              2.121e-02  7.288e-03   2.911  0.00361 **
GDP                                -7.582e-15  1.597e-14  -0.475  0.63501   
Population                         -9.145e-10  1.488e-09  -0.615  0.53873   
TeamAndorra                         1.976e-01  3.152e+03   0.000  0.99995   
TeamArgentina                       1.351e+01  2.289e+03   0.006  0.99529   
TeamArmenia                        -1.016e-01  3.347e+03   0.000  0.99998   
TeamAustralia                       1.786e+01  2.289e+03   0.008  0.99377   
TeamAustria                         1.461e+01  2.289e+03   0.006  0.99491   
TeamAzerbaijan                      1.938e-01  3.147e+03   0.000  0.99995   
TeamBahrain                         3.710e-01  3.247e+03   0.000  0.99991   
TeamBelarus                         1.527e+01  2.289e+03   0.007  0.99468   
TeamBelgium                         1.448e+01  2.289e+03   0.006  0.99495   
 [ reached getOption("max.print") -- omitted 96 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10328  on 12766  degrees of freedom
Residual deviance:  6736  on 12656  degrees of freedom
  (1327 observations deleted due to missingness)
AIC: 6958

Number of Fisher Scoring iterations: 17
fitting null model for pseudo-r2
McFadden 
0.347799 
         Predicted 0 Predicted 1 Total
Actual 0       10605         378 10983
Actual 1         910         874  1784
Total          11515        1252 12767

Area under the curve: 0.8801

So for different sports, there are different factors that go into modeling whether an athelete is awarded a medal or not, and for some, they have a “human element” not captured by the data, and for others, this dataset has all the information needed to accurately predict whether you will recieve a medal or not.

7 Predicting Model Winners Based on Athlete’s Information and Sport

In this section a classification model will be built using kNN and Random Forest that can predict if an athlete with certain characteristics can win a gold, silver, or bronze medal when they participate in the Olympics.

First step is to drop columns that maybe not useful. This is done based on domain knowldge as well as recognoizing that some factors will be co-linearly related. For example Gross Domestic Product per capita (GDPpC) of a country’s athlete will have a direct relationship to GDP and population. Another example is Weight, Height and BMI. The former two are included in BMI’s calculation.

NA values are all dropped. Attempting linear interpolation for GDPpC assumes that that is linear in its nature, however, as shown in the figure below it has cyclical characterisitc.

GDP growth (annual %) - Afghanistan, Indonesia, Jordan, Russian Federation, Mexico

GDP growth (annual %) - Afghanistan, Indonesia, Jordan, Russian Federation, Mexico

Furthermore, estimating missing age, height and weight for athelets will result inaccurate data being input into models.

NOC Year Decade Sex Age BMI BMI.Category GDPpC Season Sport Medal Medal.No.Yes Sex.Int NOC.Int Sport.Int
AFG 2008 2000s M 21 18.81215 2 364.6605 Summer Taekwondo Bronze 1 2 1 44
AFG 2012 2010s M 25 18.81215 2 641.8722 Summer Taekwondo Bronze 1 2 1 44
ARG 1964 1960s M 34 22.49135 2 1173.2382 Summer Equestrianism Silver 1 2 4 16
ARG 1968 1960s M 22 24.91077 2 1141.0806 Summer Boxing Bronze 1 2 4 10
ARG 1968 1960s M 24 25.99244 3 1141.0806 Summer Rowing Bronze 1 2 4 31
ARG 1972 1970s M 28 25.99244 3 1408.8652 Summer Rowing Silver 1 2 4 31

7.1 Random Forest

Data is split into train and test sets at 70% and 30% ratio respectively.

[1] 0.6999722
[1] 12582
[1] 5393

Below is feature selection method in Random Forest library to pick best features to produce highest accuracy using crossvalidation and Backward feature selection.

Accuracy vs Number of Variables

Accuracy vs Number of Variables

The results above indicate the features that give the best accuracy results are GDPpC, Sex, Decade and Sport.


Call:
 randomForest(formula = Medal ~ GDPpC + Sex + Decade + Sport,      data = data_train1, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 32.61%
Confusion matrix:
       Bronze Gold Silver class.error
Bronze   2838  734    647   0.3273288
Gold      619 3007    552   0.2802776
Silver    785  766   2634   0.3706093

Initial model without any tuning gives the results below:

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  8.208552e-01   7.312837e-01   8.140410e-01   8.275207e-01   3.353203e-01 
AccuracyPValue  McnemarPValue 
  0.000000e+00   9.850543e-23 
              Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: Bronze   0.8343209   0.9128303      0.8284302      0.9161166 0.8284302
Class: Gold     0.8528004   0.8873156      0.7900222      0.9238107 0.7900222
Class: Silver   0.7753883   0.9311659      0.8488098      0.8926818 0.8488098
                 Recall        F1 Prevalence Detection Rate
Class: Bronze 0.8343209 0.8313651  0.3353203      0.2797647
Class: Gold   0.8528004 0.8202118  0.3320617      0.2831823
Class: Silver 0.7753883 0.8104396  0.3326180      0.2579081
              Detection Prevalence Balanced Accuracy
Class: Bronze            0.3377047         0.8735756
Class: Gold              0.3584486         0.8700580
Class: Silver            0.3038468         0.8532771
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  6.690154e-01   5.031396e-01   6.562748e-01   6.815729e-01   3.517523e-01 
AccuracyPValue  McnemarPValue 
  0.000000e+00   2.131689e-05 
              Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: Bronze   0.6700053   0.8292334      0.6804069      0.8224113 0.6804069
Class: Gold     0.7197958   0.8228650      0.6637029      0.8580868 0.6637029
Class: Silver   0.6162724   0.8510929      0.6621203      0.8240741 0.6621203
                 Recall        F1 Prevalence Detection Rate
Class: Bronze 0.6700053 0.6751660  0.3517523      0.2356759
Class: Gold   0.7197958 0.6906122  0.3269052      0.2353050
Class: Silver 0.6162724 0.6383742  0.3213425      0.1980345
              Detection Prevalence Balanced Accuracy
Class: Bronze            0.3463749         0.7496193
Class: Gold              0.3545337         0.7713304
Class: Silver            0.2990914         0.7336826
          Reference
Prediction Bronze Gold Silver
    Bronze   1271  268    329
    Gold      307 1269    336
    Silver    319  226   1068

Model Parameter Tunning:

Default number of trees built in the randomForest model is ntree = 500. The code below aims to increase the number of trees 3 times in 250 increments to see if it improves the confusion matrix metrics.

Random forest ntree = 500, 750, 1000 and 1250. Accuracy(Red), Sensitivty(Green), Specificty(Blue) and Precision(Black)

Random forest ntree = 500, 750, 1000 and 1250. Accuracy(Red), Sensitivty(Green), Specificty(Blue) and Precision(Black)

ntree = 1000 gave the best results for all metrics except sensitivity which decrease by approximately 0.1%.

Next the maximum number of nodes is altered and models are compared with ntree = 1000. The maxnodes parameter was varied 2500, 3000, 4000 and 5000.

Random forest maxnodes = 2500,3000,4000 and 5000. Accuracy(Red), Sensitivty(Green), Specificty(Blue) and Precision(Black)

Random forest maxnodes = 2500,3000,4000 and 5000. Accuracy(Red), Sensitivty(Green), Specificty(Blue) and Precision(Black)

Given the small accuracy changes due to the tuning attempts above, the final model is kept as done before with default parameters.

AUC = 0.75 an acceptable results.

Next a KNN model will be built and compared with the final Random Forest model above.

7.2 KNN

In order for the kNN model package to give the best result- the data is first scaled and all categorical columns are mapped to 1s and 0s. The same seed number and split percentages are used in order to be able to compare the model with Random Forest model.

[1] 0.6999722
[1] 12582
[1] 5393

Plotting the accuracy of kNN with default parameters and varying k between 1-31 gives the restuls below for the same features used:

Accuracy vs k- kNN model

Accuracy vs k- kNN model

From the plot above k=5 gives the best accuracy.

Validation using training data results are shown below:

olympics_5NN1
Bronze   Gold Silver 
  4709   4369   3504 

Validation using test data results are shown below:

olympics_5NN2
Bronze   Gold Silver 
  1995   1897   1501 

Confusion matrix for both training and test validations are shown below:

          Reference
Prediction Bronze Gold Silver
    Bronze   3393  595    721
    Gold      491 3225    653
    Silver    335  358   2811
          Reference
Prediction Bronze Gold Silver
    Bronze   1284  329    382
    Gold      353 1219    325
    Silver    260  215   1026

Below is the summary of Accuracy, Sensitivity, Specificity and Precision for the training and test sets at k = 5.

Model Accuracy Sensitivity Specificity Precision
Train 0.7494039 0.804219 0.7719004 0.6716846
Test 0.6543668 0.6768582 0.6914351 0.5920369

An the results for Random Forest

Model Accuracy Sensitivity Specificity Precision
Train 0.8208552 0.8343209 0.8528004 0.7753883
Test 0.6690154 0.6700053 0.7197958 0.6162724

The results show that random Forest model give better results than kNN.

8 Pandemic (Spanish Flu)

With the novel corona virus pandemic this year and its delay of the Olympics in Tokyo this year, we thought it would be interesting to study a last pandemic in the last century and analyze the impact it had on the Olympic performance. The following countries in Europe had 2.64 million excess deaths occurred during the period when the H1N1 Pandemic (also commonly called Spanish Flu) was circulating from January 1918 - June 1919: Italy, Bulgaria, Portugal, Spain, Netherlands, Sweden, Germany, Switzerland, France, Norway, Denmark, UK (Scotland, England, Wales). In the US, 675,000 people died from H1N1 which was 0.8 percent of the 1910 population.

(JOHNSON, NIALL P. A. S., and JUERGEN MUELLER. “Updating the Accounts: Global Mortality of the 1918-1920 ‘Spanish’ Influenza Pandemic.” Bulletin of the History of Medicine, vol. 76, no. 1, 2002, pp. 105–115. JSTOR, www.jstor.org/stable/44446153. Accessed 19 Apr. 2020.) Taken from1

Of the European countries that suffered significant excess deaths during the Spanish Influenza Pandemic, these countries competed before and after 1918-1919: Denmark (DEN), France (FRA), Great Britain (GBR), Italy (ITA), Netherlands (NED), Norway (NOR), Sweden (SWE), and United States (USA). We created a separate pandemic data set containing athletes from these countries that competed in the Olympics between 1908-1928 to study before and after the pandemic.

8.1 Medals Earned

The plot shows number of medals earned by Denmark (DEN), France (FRA), Great Britain (GBR), Italy (ITA), Netherlands (NED), Norway (NOR), Sweden (SWE), and United States (USA) before and after the pandemic. There may be more than Gold, Silver or Bronze medals earned by individuals in any sporting event; accounting for the group events. Great Britain (GBR), Denmark (DEN), Sweden (SWE) saw a decline in the number of medals their athletes earned after the pandemic. The Olympics were not held in 1916 due to World War I.

8.2 Number of Olympic Athletes

Looking at the same countries, the plot shows the number of athletes they sent to the Olympics from 1908 - 1928. Great Britain and Sweden saw a sharp decline in the number of athletes that they sent to Olympic events after the pandemic. JOHNSON, NIALL P. A. S., and JUERGEN MUELLER reported England & Wales had approximately 200,000 death toll (per 1,000) and Sweden reported 34,374 death toll during the 1918-1919 pandemic.

8.3 Average Age of Olympians

The chart displays the average age of Olympians from the eight countries in our data before and after the pandemic.
The H1N1 Influenza pandemic “Spanish flu” was fatal for individuals aged 20–40 years. The average age of Olympians competing after the pandemic was increased for all countries in our data set.

8.4 Average Height and Weight of Olympians

Netherlands (NED) and Norway (NOR) had significant increase in the average height and weight of their athletes after Spanish flu pandemic. Sweden (SWE) and France (FRA) saw a decrease in those averages.

8.5 Total Number of Olympic Medals (Summer Events) vs. Year

8.6 Creating Time Series - Italy

During the 1918 Pandemic, Italy’s death toll (per 1,000) was 390,000 death toll. According to Worldodometer, during the current Covid-19 pandemic, there has been 29,079 deaths in Italy. We will focus on Italy to conduct time series analysis. Can we see a pattern with historical Olympic data before and after the Spanish flu in order to predict how Italy will fare in future Olympics after is emerges from the current novel Corona virus pandemic?

The Olympics data for Italy from 1908-1928 was converted to a time series and plotted “Total Number of Medals vs Year”.

The time series shows random fluctuations in the data over time, no overall trend. The Auto Correlated Function (ACF) plot does not show seasonality, periodicity and cyclic nature of the series. The ACF values are less than 0.05, so they are not significant. So the number of medals may not be correlated to time.

8.7 Using 1908-1928 Time Series as Training Data

From the quick EDA, time series may not be appropriate method to model the number of medals Italy may earn in future Olympics. As a learning academic exercise, we will explore different time series methodologies to forecast.

8.8 Exploring Holt-Winters and ETS-ANN Time Series Forecasting

Holt-Winters uses exponential smoothing to make short-term forecasts. Models is designated as either additive or multiplicative.

             Length Class  Mode     
fitted       10     mts    numeric  
x             6     ts     numeric  
alpha         1     -none- numeric  
beta          1     -none- logical  
gamma         1     -none- logical  
coefficients  1     -none- numeric  
seasonal      1     -none- character
SSE           1     -none- numeric  
call          4     -none- call     

     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2012       129.7571 80.88325 178.6310 55.01098 204.5032
2016       129.7571 67.63072 191.8835 34.74299 224.7712
2020       129.7571 56.74531 202.7689 18.09519 241.4190

8.9 Time Series Linear Model for Italy

Above plots try to fit the number of medals to an algorithm that could would best be able to predict the number of Olympic medals Italy will earn after Covid-19 pandemic. Visually, Cubic Spline seems closest in approximation.

8.10 Arima Model

Series: olympic_all 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
          mean
      107.6000
s.e.   11.1972

sigma^2 estimated as 2015:  log likelihood=-77.83
AIC=159.66   AICc=160.66   BIC=161.07

Training set error measures:
                      ME     RMSE      MAE       MPE     MAPE      MASE
Training set 1.98952e-14 43.36635 37.70667 -18.08371 41.84652 0.3504337
                  ACF1
Training set 0.2284556

The Olympic dat set was divided into a training set (1908-1928) which includes the 1918-1919 Spanish Influenza pandemic and test set (2008-2016). Using MAPE (Mean Absolute Percent Error) measures the size of the error in percentage terms (Taken from2) to evaluate Holt-Winters and ETS-ANN time series predictions for 2012 and 2016 for Medals won by Italy. The table below show MAPE values less than 10% which is a reasonable model.

Holt-Winters and ETS-ANN Forecast: Italy Olympic Medals
Year Actual Predicted MAPE (%)
2012 132 130 9.72
2016 144 130 1.53
2020 NA 130 NA

ARIMA Time Series model predicts 107 medals for Italy in 2020. We will have to wait until the next Olympics to be able to evaluate the accuracy of these predictions for Italy’s performance.

The analysis of Olympic Medals that Italy earned before and after the 1918-1919 Spanish flu Pandemic and more current data were constructed into a time series on a superficial level. The resulting Olympic time series data on Italy could not be easily decomposed into the main components of a time series: trend, season or irregular fluctations. There are not hidden information we could be deciphered from studying 1918-1919 pandemic that could inform the future after Covid-19.

10 Summary and Conclusions

10.1 KMeans & KMedoids

  • On continuous categories: Age, Weight, Height, Population, GDP
    • Triathlon: are statistically significant clusters (H >= 0.77) and Population+GDP appear to add another cluster
    • Softball: are statistically significant clusters (H >= 0.77) however Population+GDP do not appear to add another cluster in this case/the clusters, if present, are interwoven/on top of one another

10.2 Logistic Regression

  • Difficult to model Medals: Yes No for entire data set
    • modeling Medals: Yes No more successful when subsetting data by Sport
    • Some sports like Softball and Baseball can be very accurately modelled with only Population and Team, whereas Swimming requires Age, Height, and Team

10.3 KNN vs Random Forest

  • Better to model Bronze/Silver/Gold versus Model Yes No
    • Random Forest higher accuracy and 4 models had highest accuracy

10.4 Pandemic

  • Time series analysis using Holt-Winters and ARIMA
    • Need to employ evaluation techniques on models
    • Periodic behavior observed
    • Study of 1918-1919 Spanish Influenza Pandemic did not reveal any major conclusions to the current Covid-19 pandemic. World War I also coincided with the Spanish Flu.

References

1 N.P.A.S. Johnson and J. Mueller, Bulletin of the History of Medicine 76, 105 (2002).

2 E. Stellwagen, Forecasting 101: A Guide to Forecast Error Measurement Statistics and How to Use Them (n.d.).

3 R. Wood, (n.d.).